The objective of this project is to perform an anlysis of Gun Violence in US in the past decade. The statistics have as their main purpose discovering trends and correlations between several factors and gun violence and to predict based on the data we have the number of incidents in the future.
We have chosen to employ both Python and R as programming languages, the former being used for data processing and modelling purposes, and the latter for the actual visualization of the data gathered.
Three data sets have been used, the primary one which includes the topic specific information, and two supporting sets that contain the population for each state according to the 2019 census and the state codes.
The data source for the project is Kaggle.
R specific libraries import
library(hrbrthemes)
## Registered S3 methods overwritten by 'tibble':
## method from
## format.tbl pillar
## print.tbl pillar
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(reticulate)
library(ggplot2)
library(hrbrthemes)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Python specific libraries import
import pandas as pd
import numpy as np
import datetime as dt
fileName = 'gun-violence.csv'
fullsetDF = pd.read_csv(fileName)
fileName = 'us_population_by_state.csv'
populationDF = pd.read_csv(fileName)
fileName = 'state_codes.csv'
stateCodesDF = pd.read_csv(fileName)
fullsetDF = fullsetDF.drop("incident_id", axis = 1)
fullsetDF = fullsetDF.drop("incident_url_fields_missing", axis = 1)
We have split the date into months and years sine this information will be used later on when studying the changes over the years and if there are any “seasonal changes”.
def splitDates(datesList):
months = []
years = []
for i in range(0, len(datesList)):
value = datesList[i]
splitDate = dt.datetime.strptime(value, "%Y-%m-%d")
monthName = splitDate.strftime('%b')
months.append(monthName)
years.append(splitDate.year)
fullsetDF.insert(1, "MONTH", months)
fullsetDF.insert(2, "YEAR", years)
allDates = fullsetDF['date']
splitDates(allDates)
We would like to see if the type of venue has any impact on the number of incidents that occur. However, there are no clearly defined types and as such, we had to define them ourselves and see which fall into the specific category.The location can be mentioned in several fields, so we will use all of them for our search: location_description, incident_characteristics and notes.
A new field was added called “Location Type”
After this classification, the field “location_description” is no longer useful, so it was deleted. - MEH
locationType = ["Bar/Club", "College", "School", "Apartments", "Park", "Restaurant"]
locations = fullsetDF["location_description"]
characteristics = fullsetDF["incident_characteristics"]
notes = fullsetDF["notes"]
The field incident_characteristics that can be used to find types of incidents. As described previously, we have defined certain categories and searched the text to find in which categories incident fall in.
Unlike the previous example where only one venue was possible, here there can be more than one or more characteristics describing each incident, which required us to define different fields with a YES/NO response (binary categorical variables).
incidentType = ["Home Invasion", "Mass Shooting", "Officer Involved", "Armed Robbery", "Drive-by", "Domestic Violence","Gang"]
fieldNo = fullsetDF.columns.get_loc("incident_characteristics")
matrix = []
for i in range(0, len(incidentType)):
incidentTypeDiscovered = []
for j in range(0, len(characteristics)):
value = characteristics[j]
if type(value) != float:
if incidentType[i].lower() in value.lower():
incidentTypeDiscovered.append("YES")
else:
incidentTypeDiscovered.append("NO")
else:
incidentTypeDiscovered.append("UNKNOWN")
fieldName = incidentType[i]
fullsetDF.insert(fieldNo, fieldName, incidentTypeDiscovered)
The number of guns is not clearly stated. However, we can calculate it based on the number of values in the field “gun_stolen” (if there is no information, the number shown will be 0; it does not signify the absence of a gun but rather lack of sufficient information). A new field is created “No of Guns” which will be positioned right before the “gun_stolen” field.
allInfoGuns = fullsetDF["gun_stolen"]
noOfGuns = []
for i in range (0, len(allInfoGuns)):
arrValuesEntry = []
value = allInfoGuns[i]
if type(value) == float:
# We should check here daca chiar e ok sa punem 0 sau ar trebui N/A or something like that. Am putea sa punem 1 because we assume there was at least 1 gun involved, it seems reasonable to make this assumption
noOfGuns.append(0)
else:
arrValuesEntry = value.split('||')
noOfGuns.append(len(arrValuesEntry))
fieldNo = fullsetDF.columns.get_loc("gun_stolen")
fullsetDF.insert(fieldNo, "NO OF GUNS", noOfGuns)
Field “participant_gender” offers information for all the people involved. We would like to get the actual number and add these fields to the dataframe.
genderParticipants = fullsetDF["participant_gender"]
female = []
male = []
for i in range (0, len(genderParticipants)):
value = genderParticipants[i]
if type(value) != float:
no = value.count("Female")
female.append(no)
no = value.count("Male")
male.append(no)
else:
female.append(0)
male.append(0)
fieldNo = fullsetDF.columns.get_loc("participant_gender")
fullsetDF.insert(fieldNo, "FEMALES", female)
fullsetDF.insert(fieldNo, "MALES", male)
We would also like to make the distinction between the number of adults, teenagers and children involved in the incidents and we will use the same process as above on the field “participant_age_group”.
ageParticipants = fullsetDF["participant_age_group"]
children = []
teenagers = []
adults = []
for i in range (0, len(ageParticipants)):
value = ageParticipants[i]
if type(value) != float:
no = value.count("Adult 18+")
adults.append(no)
no = value.count("Teen 12-17")
teenagers.append(no)
no = value.count("Child 0-11")
children.append(no)
else:
adults.append(0)
teenagers.append(0)
children.append(0)
fieldNo = fullsetDF.columns.get_loc("participant_age_group")
fullsetDF.insert(fieldNo, "ADULTS", adults)
fullsetDF.insert(fieldNo, "TEENAGERS", teenagers)
fullsetDF.insert(fieldNo, "CHILDREN", children)
This is completely useless. Nu ma incanta cu ABSOLUT nimic, mai tare ma incurca to be honest fullsetDF = fullsetDF.rename(columns={ “date”:“DATE”, “state”:“STATE”, “n_killed”:“NO PEOPLE KILLED”, “n_injured”:“NO PEOPLE INJURED”, “gun_stolen”:“STOLEN GUN”, })
#city_or_county address n_killed n_injured incident_url source_url incident_url_fields_missing congressional_district gun_stolen gun_type incident_characteristics latitude location_description longitude n_guns_involved notes participant_age participant_age_group participant_gender participant_name participant_relationship participant_status participant_type sources state_house_district state_senate_district
fullsetDF.to_csv("gun-violence_processed-data.csv")
This section contains all the variables that will be used going further for plotting. Any data selection or processing will be done here.
Calculating total numbers of * incidents that occured * cases that resulted in killings * cases that resulted in injuries in each state, regardless of month or year.
totalPerState = []
totalKillingsPerState = []
totalInjuriesPerState = []
for i in range (0, len(stateCodesDF)):
state = stateCodesDF["State"][i]
dfHelper = fullsetDF[fullsetDF["state"]==state]['state'].copy(deep=True)
totalPerState.append(dfHelper.count().item())
dfHelper = fullsetDF[(fullsetDF["state"] == state) & (fullsetDF['n_killed'] > 0)]['state'].copy(deep=True)
totalKillingsPerState.append(dfHelper.count().item())
dfHelper = fullsetDF[(fullsetDF["state"] == state) & (fullsetDF['n_injured'] > 0)]['state'].copy(deep=True)
totalInjuriesPerState.append(dfHelper.count().item())
stateCodes = stateCodesDF["Code"]
We have defined a function for this type of plot and called it for each of the scenarios.
initial_eda_barplots <- function(all_labels, all_values, x_name, y_name, plot_title, colour){
data=data.frame(name = all_labels, value = all_values)
totalIncidentsBarPlot <- ggplot(data, aes(x=name, y = value)) + geom_bar(stat = "identity", width=0.7, fill = colour) +
xlab(x_name) +
ylab(y_name) +
ggtitle(plot_title)
totalIncidentsBarPlot +
theme(
plot.title = element_text(hjust = 0.5),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white"),
axis.text.x = element_text(angle = 90, vjust = 0.5)
)
}
Variable: Total number of incidents that occured
We want to see the distribution of the total number of incidents per each state. We will use the totalPerState dataframe and the codes from the stateCodes array.
r_totalPerState <- py$totalPerState
r_labels <- py$stateCodes
initial_eda_barplots(r_labels, r_totalPerState, "State Code", "Number of incidents", "Distribution of incidents per State", "#091CC1")
The general level of incidents can roughly be found somewhere in the interval [0,5000], but we do see some outliers - states have a significantly higher number of gun-involved cases.
Variable: Total number of cases that resulted in killings
r_totalKillingsPerState <- py$totalKillingsPerState
initial_eda_barplots(r_labels, r_totalKillingsPerState, "State Code", "Number of killings", "Distribution of cases that resulted in killings per State", "#652194")
Variable: Total number of cases that resulted in injuries
r_totalInjuriesPerState <- py$totalInjuriesPerState
initial_eda_barplots(r_labels, r_totalInjuriesPerState, "State Code", "Number of injuries", "Distribution of cases that resulted in injuries per State", "#134611")
Outlier values can be noticed in the last two charts as well. Given this observation, it would make sense to realize another chart - a Boxplot - that would specifically highlight this information. Even though boxplots can sometimes be missleading because of the loss of information, our purpose for this scenario is to identify those states that do not fall into the IQR (Interquantile Range), purpose which makes this type of chart appropiate.
Variables: Total number of incidents, firearm deaths and injuries, respectively
r_totalPerState <- py$totalPerState
r_totalKillingsPerState <- py$totalKillingsPerState
r_totalInjuriesPerState <- py$totalInjuriesPerState
p <- boxplot(r_totalPerState, r_totalKillingsPerState, r_totalInjuriesPerState,
main = "Outlier variables - States that have higher criminality",
names = c("Number of incidents", "Number of killings", "Number of injuries"),
col = c("#091CC1","#652194", "#134611")
)
p
## $stats
## [,1] [,2] [,3]
## [1,] 289.0 47.0 47.0
## [2,] 1610.0 234.0 515.0
## [3,] 3201.0 704.0 1120.0
## [4,] 6383.5 1555.5 2770.5
## [5,] 10244.0 3354.0 5870.0
## attr(,"class")
## Number of incidents
## "integer"
##
## $n
## [1] 51 51 51
##
## $conf
## [,1] [,2] [,3]
## [1,] 2144.891 411.6257 620.9836
## [2,] 4257.109 996.3743 1619.0164
##
## $out
## [1] 16306 15029 17556 13577 4972 4350 11023
##
## $group
## [1] 1 1 1 1 2 2 3
##
## $names
## [1] "Number of incidents" "Number of killings" "Number of injuries"
We have decided to further investigate the outlier states so as to better understand the factors that led to an increased number of incidents involving firearms.
The first type of analysis targets the age categories. For this purpose, the data was visualized in separate doughnut charts as seen below.
Calculating the percentage of Child, Teenager and Adult participants for each state.
outlierList = np.array(["Pennsylvania", "California", "Colorado"])
totalsStates = np.zeros(shape=(len(outlierList),3), dtype = float)
categories = np.array(["CHILDREN", "TEENAGERS", "ADULTS"])
for i in range(0,len(outlierList)):
state = outlierList[i]
array = []
dfHelper = fullsetDF[fullsetDF["state"]==state][categories[0]].copy(deep=True)
totalsStates[i][0] = dfHelper.sum().item()
dfHelper = fullsetDF[fullsetDF["state"]==state][categories[1]].copy(deep=True)
totalsStates[i][1] = dfHelper.sum().item()
dfHelper = fullsetDF[fullsetDF["state"]==state][categories[2]].copy(deep=True)
totalsStates[i][2] = dfHelper.sum().item()
# Percentages per state for each group
for i in range(0, len(outlierList)):
total = totalsStates[i][0] + totalsStates[i][1] + totalsStates[i][2]
for j in range(0, 3):
totalsStates[i][j] = float("{:.4f}".format(totalsStates[i][j] / total))*100
donut_plot_labels <- function(all_labels, all_values) {
data = data.frame(name = all_labels, value = all_values)
data$ymax = cumsum(all_values)
data$ymin = c(0, head(data$ymax, n=-1))
data$labelPosition <- (data$ymax + data$ymin) / 2
data$label <- paste0(data$name, ": ", data$value, "%")
ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=name)) +
geom_rect() +
geom_text( x=2, aes(y=labelPosition, label=label, color=name), size=3) +
scale_fill_brewer(palette = "Dark2") +
scale_color_brewer(palette = "Dark2") +
coord_polar(theta="y") +
xlim(c(-1, 4))
}
Variables: Percentage of each age group involved in an incident per state and across the states.
r_categories <- py$categories
matrix <- py$totalsStates
for(row in 1 : nrow(matrix)){
r_totalsStates<- matrix[row,]
print(donut_plot_labels(r_categories, r_totalsStates))
}
We would also like to have an overview of the age groups across the outlier states to see if there are any differences between the states. We could assume that some of the states have a higher percentage of teenagers involved in gun-voilence than others.
# Percentages per group for each state
for j in range(0, 3):
total = totalsStates[0][j] + totalsStates[1][j] + totalsStates[2][j]
for i in range(0, len(outlierList)):
totalsStates[i][j] = float("{:.4f}".format(totalsStates[i][j] / total))*100
totalsStates = totalsStates.transpose()
r_outlierList <- py$outlierList
matrix <- py$totalsStates
for(row in 1 : nrow(matrix)){
r_totalsStates<- matrix[row,]
print(donut_plot_labels(r_outlierList, r_totalsStates))
}
The purpose of the next section is to identify the number of guns used in all the incidents per state.
gunsTotalPerState = []
for i in range (0, len(stateCodesDF)):
state = stateCodesDF["State"][i]
dfHelper = fullsetDF[fullsetDF["state"]==state]["n_guns_involved"].copy(deep=True)
gunsTotalPerState.append(dfHelper.sum().item())
gunsTotalPerState = np.array(gunsTotalPerState)
r_gunsTotalPerState <- py$gunsTotalPerState
The 2 categorical variables that are taken into account for this type of chart are the IncidentType and the States. We have previously split the incidet type into 7 different categories. We would now like to make an analysis of the causes of gun-violence, i.e. how often each type occurs for the individual states.
typeTotalPerState = np.empty(shape=(len(incidentType)*len(stateCodesDF)))
it = 0
for i in range(0, len(incidentType)):
incident = incidentType[i]
for j in range (0, len(stateCodesDF)):
state = stateCodesDF["State"][j]
dfHelper = fullsetDF[(fullsetDF["state"]==state) & (fullsetDF[incident] == "YES")][incident].copy(deep=True)
typeTotalPerState[it] = dfHelper.count()
it = it + 1
r_typeTotalPerState <- py$typeTotalPerState
x <- py$stateCodes
y <- py$incidentType
data <- expand.grid(X=x, Y=y)
data$Z <- r_typeTotalPerState
ggplot(data, aes(X, Y, fill=Z)) +
geom_tile() +
scale_fill_distiller(palette = "Greens") +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5)
)
We want to see the distribution of people involved in an incident, split by gender. We will use two Y axis and …dataframes.
We want
Variables: Total number of incidents, total number of incidents per state => procent
Calculating total number of incidents per state and adding them into a dictionary. The states with incident percentages less than 1% will be categorized as others.
totalPerState = []
incidentsStatesArray = np.array(fullsetDF['state'].to_numpy())
for i in range(0, len(stateCodesDF)):
state = stateCodesDF["State"][i]
totalPerState.append(incidentsStatesArray[incidentsStatesArray == state].shape[0])
DictStateIncident = {"Others": 0}
dfHelper = fullsetDF[fullsetDF["state"]==state]['state'].copy(deep=True)
totalIncidents = len(dfHelper)
incidentsPerStatePercentages = (np.array(totalPerState) / totalIncidents)
for i in range(0,len(stateCodes)):
if incidentsPerStatePercentages[i] > 1:
DictStateIncident[stateCodes[i]] = round(incidentsPerStatePercentages[i], 2)
else:
DictStateIncident["Others"] = DictStateIncident["Others"] + round(incidentsPerStatePercentages[i], 2)
statesForDonut = np.array(list(DictStateIncident.keys()))
valuesForDonut = np.array(list(DictStateIncident.values()))
donut_plot <- function(all_labels, all_values) {
data = data.frame(name = all_labels, value = all_values)
data$ymax = cumsum(all_values)
data$ymin = c(0, head(data$ymax, n=-1))
ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=name)) +
geom_rect() +
coord_polar(theta="y") +
xlim(c(2, 4))
}
r_labels <- py$statesForDonut
r_percentagesPerState <- py$valuesForDonut
donut_plot(r_labels, r_percentagesPerState)
Calculating the number of incidents per region: NE, NW, SW, SE.
We consider the center position the geographic center of the United States. This is a point approximately 32 km north of Belle Fourche, South Dakota at LAT. 39°50’ LONG. −98°35’
DictIncidentsPerRegion = {"NE": 0, "NW": 0, "SW": 0, "SE": 0}
incidentsCoordinates = fullsetDF[["latitude", "longitude"]]
for i in range(0, len(incidentsCoordinates)):
if incidentsCoordinates["latitude"][i] > 39 and incidentsCoordinates["longitude"][i] > -98:
DictIncidentsPerRegion["NE"] += 1
if incidentsCoordinates["latitude"][i] > 39 and incidentsCoordinates["longitude"][i] < -98:
DictIncidentsPerRegion["NW"] += 1
if incidentsCoordinates["latitude"][i] < 39 and incidentsCoordinates["longitude"][i] < -98:
DictIncidentsPerRegion["SW"] += 1
if incidentsCoordinates["latitude"][i] < 39 and incidentsCoordinates["longitude"][i] > -98:
DictIncidentsPerRegion["SE"] += 1
totalNoIncidents = sum(DictIncidentsPerRegion.values())
regions = np.array(list(DictIncidentsPerRegion.keys()))
incidentsPerRegions = np.round(np.array((np.array(list(DictIncidentsPerRegion.values())) / totalNoIncidents) * 100), 2)
r_regions <- py$regions
r_incidentsPerRegion <- py$incidentsPerRegions
donut_plot_labels(r_regions, r_incidentsPerRegion)
In this section we show you the top 25 and the buttom 25 news websites that report crimes. In this way we could decide which online newspaper is the most and the least trustful.
sourcesPerIncident = fullsetDF[["date", "sources"]]
DictSourcesAndIncidents = {"pittsburgh.cbslocal.com" : 0}
for i in range (0, len(sourcesPerIncident["sources"])):
try:
sourcesArray = sourcesPerIncident["sources"][i].split("||")
except:
continue
for source in sourcesArray:
URI = source.strip().partition("//")[2]
URL = URI.partition("/")[0]
if URL in DictSourcesAndIncidents:
DictSourcesAndIncidents[URL] += 1
else:
DictSourcesAndIncidents[URL] = 0
lessIncidentsPerSource = dict(filter(lambda elem: elem[1] != 0, DictSourcesAndIncidents.items()))
sort_by_value = lambda DictSourcesAndIncidents: DictSourcesAndIncidents[1]
sortedSourcesDict = sorted(DictSourcesAndIncidents.items(), key = sort_by_value, reverse=True)
sourcesName = []
sourceNews = []
for x in list(sortedSourcesDict)[0:25]:
sourcesName.append(x[0])
sourceNews.append(x[1])
sortedSourcesLeastTrustful = sorted(lessIncidentsPerSource.items(), key = sort_by_value)
sourcesLeastTrustful = []
appearancesLeastTrusful = []
for x in list(sortedSourcesLeastTrustful)[0:25]:
sourcesLeastTrustful.append(x[0])
appearancesLeastTrusful.append(x[1])
library(plotly)
lollipop_function <- function(sources, appearances, title) {
data <- data.frame(x = sources, y = appearances)
p <- ggplot(data, aes(x=x, y=y)) +
geom_segment( aes(x=reorder(x, y), xend=x, y=0, yend=y), color="grey") +
geom_point( color="orange", size=4, fill=alpha("orange", 0.3), alpha=0.7, shape=21, stroke=2) +
theme_light() +
coord_flip() +
ggtitle(title) +
scale_size(range=c(2,12), guide="none") +
theme(
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(size = 12)
) +
xlab("") +
ylab("Number of incidents reported along the whole period")
ggplotly(p)
}
r_source <- py$sourcesName
r_news <- py$sourceNews
lollipop_function(r_source, r_news, "The most trustful online newspeper that report gun-violence incidents")
r_source <- py$sourcesLeastTrustful
r_news <- py$appearancesLeastTrusful
lollipop_function(r_source, r_news, "The least trustful online newspeper that report gun-violence incidents")
Chart Type: We want to see the level of crime for each state proportional to the size of the population. We have previously calcuated the respective percentages and they will be ploted below, also adding tooltips to offer more details for each state.